Stats for Data Science

Joint Mathematics Meetings January 18, 2020

Modernizing

mod·ern·ize | ˈmädərˌnīz |

verb [with object]

adapt (something) to modern needs or habits, typically by installing modern equipment or adopting modern ideas or methods: a five-year plan to modernize Algerian agriculture.

Modernizing Needs

modern needs

prediction/classification
decision making/trade-offs
causality

pre-modern needs

detect some signal in noise: averaging
infer genetics from phenotypes: \(r\)
provide arbitration for claims: \(p\)

Modernizing Equipment

modern equipment

computers
databases
high connectivity

pre-modern equipment

mechanical calculators
static tables and tabulations
printing as distribution

Modernizing Methods

modern methods

wrangling & visualization
machine/statistical learning
randomization, bootstrapping, cross-validation
directed acyclic graphs
false discovery rates, …

pre-modern methods

see, e.g. most intro stats books
correlation/simple regression
histograms

Modernizing pedagogy

Many excellent recommendations from GAISE and other sources.

I have one more to add, that I think is critical:

Base teaching on what we know now, not on what was being invented in 1880-1910.

Ontogeny recapitulates phylogeny?

In biology: Growth and development of an individual follows the same path as the evolution of a species.

Should statistics teaching follow embryology?

In education: Should individual students follow the same path as statistics as a whole?

Bernoulli \(\Rightarrow\) Gauss \(\Rightarrow\)

Quetelet \(\Rightarrow\) Galton \(\Rightarrow\)

Pearson \(\Rightarrow\) Gosset \(\Rightarrow\)

Fisher \(\Rightarrow\)

Neyman-Pearson …

probability, means,

standard deviation,

correlation coefficient,

chi-squared,

t-test, “significant”, “p-value”, …

If statistics were automobiles … 1880s

1888 Francis Galton introduces the “co-relation” coefficient

1885 Karl Benz designs 4-stroke engine for use in his automobile

If statistics were automobiles … 1900–1910

1908 William Gossett’s t statistic

1908 First Model T off Henry Ford’s production line

If statistics were automobiles … 1920s

1927 Ford Model A enters production

1925 ANOVA appears in Fisher’s Statistical Methods for Research Workers

Where does following the historical path get us?

introductory course typically ends at or before “one-way” ANOVA
we neglect thinking constructively about causation
- we end up like Fisher in 1959 arguing that smoking doesn’t cause cancer
we avoid problems of prediction, researcher degrees of freedom, false discovery, evaluating trade-offs

Detours on the path to data science

We dip into historical coves and specialized techniques

chi-squared, unequal variance t-test, one-tailed tests, histogram, stem-and-leaf, box-and-whisker
we use historical vocabulary that can be offputting or misleading
- standard deviation, standard error, margin of error, significance

Proposal: Stats for Data Science

Let’s work backward from current needs: prediction, decision-making, causality.

Meet these needs without worrying about phylogeny and history.

A curriculum, with only the essentials:

Data organization
Graphics
Models (generalize graphics to many variables)
- present “hypothesis testing” as aiding decision making when model building
… leaving time for covariates, causality, loss-functions & trade-offs, models that learn, …

Computing essentials

How can we make computing accessible to everyone, both practically and intellectually?

Practical: Browser-based applications, web apps

Intellectual: Define a small set of essential, high-level skills.

Essential, high-level computing skills

Draw a point plot. Up to four variables: y, x, color, facet.
- use jittering and transparency
Construct a model: y ~ x + z and visualize it with (1).
- allow flexibility
- allow choice of architectures: machine-learning, bounded, unbounded.
Evaluate a model at two different inputs: effect size
Compare two models, e.g. y ~ 1 and y ~ 1 + x
- cross-validated prediction error
- F

One app can do all these things in the space of a smartphone.

1. Create a model

2. Evaluate and find effect size

3. Compare models

4. Inference for comparing models

A Compact Guide to Classical Inference

How to help instructors who are in a math environment where computing is deprecated and formulas are seen as the “real math”?

Being serialized at StatPREP.org

Short book for instructors showing

How to unify all the inference settings into a single method
… that doesn’t require any computation besides the app
… that builds confidence since approximate results can be seen by eye, and
exact results use 1 simple formula

Resources

StatPREP.org: Little Apps and Compact Guide
MAA mini-course in Stats for Data Science: dtkaplan.github.io/SDS-MAA-minicourse
Draft of more extensive textbook: dtkaplan.github.io/SDS-book
The prototype app in the slides: dtkaplan.shinyapps.io/LittleAppF